19 research outputs found
MultiTASC: A Multi-Tenancy-Aware Scheduler for Cascaded DNN Inference at the Consumer Edge
Cascade systems comprise a two-model sequence, with a lightweight model
processing all samples and a heavier, higher-accuracy model conditionally
refining harder samples to improve accuracy. By placing the light model on the
device side and the heavy model on a server, model cascades constitute a widely
used distributed inference approach. With the rapid expansion of intelligent
indoor environments, such as smart homes, the new setting of Multi-Device
Cascade is emerging where multiple and diverse devices are to simultaneously
use a shared heavy model on the same server, typically located within or close
to the consumer environment. This work presents MultiTASC, a
multi-tenancy-aware scheduler that adaptively controls the forwarding decision
functions of the devices in order to maximize the system throughput, while
sustaining high accuracy and low latency. By explicitly considering device
heterogeneity, our scheduler improves the latency service-level objective (SLO)
satisfaction rate by 20-25 percentage points (pp) over state-of-the-art cascade
methods in highly heterogeneous setups, while serving over 40 devices,
showcasing its scalability.Comment: Accepted at 28th IEEE Symposium on Computers and Communications
(ISCC), 202
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
Exploring the Performance and Efficiency of Transformer Models for NLP on Mobile Devices
Deep learning (DL) is characterised by its dynamic nature, with new deep
neural network (DNN) architectures and approaches emerging every few years,
driving the field's advancement. At the same time, the ever-increasing use of
mobile devices (MDs) has resulted in a surge of DNN-based mobile applications.
Although traditional architectures, like CNNs and RNNs, have been successfully
integrated into MDs, this is not the case for Transformers, a relatively new
model family that has achieved new levels of accuracy across AI tasks, but
poses significant computational challenges. In this work, we aim to make steps
towards bridging this gap by examining the current state of Transformers'
on-device execution. To this end, we construct a benchmark of representative
models and thoroughly evaluate their performance across MDs with different
computational capabilities. Our experimental results show that Transformers are
not accelerator-friendly and indicate the need for software and hardware
optimisations to achieve efficient deployment.Comment: Accepted at the 3rd IEEE International Workshop on Distributed
Intelligent Systems (DistInSys), 202
Approximate FPGA-based LSTMs under Computation Time Constraints
Recurrent Neural Networks and in particular Long Short-Term Memory (LSTM)
networks have demonstrated state-of-the-art accuracy in several emerging
Artificial Intelligence tasks. However, the models are becoming increasingly
demanding in terms of computational and memory load. Emerging latency-sensitive
applications including mobile robots and autonomous vehicles often operate
under stringent computation time constraints. In this paper, we address the
challenge of deploying computationally demanding LSTMs at a constrained time
budget by introducing an approximate computing scheme that combines iterative
low-rank compression and pruning, along with a novel FPGA-based LSTM
architecture. Combined in an end-to-end framework, the approximation method's
parameters are optimised and the architecture is configured to address the
problem of high-performance LSTM execution in time-constrained applications.
Quantitative evaluation on a real-life image captioning application indicates
that the proposed methods required up to 6.5x less time to achieve the same
application-level accuracy compared to a baseline method, while achieving an
average of 25x higher accuracy under the same computation time constraints.Comment: Accepted at the 14th International Symposium in Applied
Reconfigurable Computing (ARC) 201
EmBench: Quantifying Performance Variations of Deep Neural Networks across Modern Commodity Devices
In recent years, advances in deep learning have resulted in unprecedented
leaps in diverse tasks spanning from speech and object recognition to context
awareness and health monitoring. As a result, an increasing number of
AI-enabled applications are being developed targeting ubiquitous and mobile
devices. While deep neural networks (DNNs) are getting bigger and more complex,
they also impose a heavy computational and energy burden on the host devices,
which has led to the integration of various specialized processors in commodity
devices. Given the broad range of competing DNN architectures and the
heterogeneity of the target hardware, there is an emerging need to understand
the compatibility between DNN-platform pairs and the expected performance
benefits on each platform. This work attempts to demystify this landscape by
systematically evaluating a collection of state-of-the-art DNNs on a wide
variety of commodity devices. In this respect, we identify potential
bottlenecks in each architecture and provide important guidelines that can
assist the community in the co-design of more efficient DNNs and accelerators.Comment: Accepted at MobiSys 2019: 3rd International Workshop on Embedded and
Mobile Deep Learning (EMDL), 201
Multi-Exit Semantic Segmentation Networks
Semantic segmentation arises as the backbone of many vision systems, spanning
from self-driving cars and robot navigation to augmented reality and
teleconferencing. Frequently operating under stringent latency constraints
within a limited resource envelope, optimising for efficient execution becomes
important. At the same time, the heterogeneous capabilities of the target
platforms and the diverse constraints of different applications require the
design and training of multiple target-specific segmentation models, leading to
excessive maintenance costs. To this end, we propose a framework for converting
state-of-the-art segmentation CNNs to Multi-Exit Semantic Segmentation (MESS)
networks: specially trained models that employ parametrised early exits along
their depth to i) dynamically save computation during inference on easier
samples and ii) save training and maintenance cost by offering a post-training
customisable speed-accuracy trade-off. Designing and training such networks
naively can hurt performance. Thus, we propose a novel two-staged training
scheme for multi-exit networks. Furthermore, the parametrisation of MESS
enables co-optimising the number, placement and architecture of the attached
segmentation heads along with the exit policy, upon deployment via exhaustive
search in <1 GPUh. This allows MESS to rapidly adapt to the device capabilities
and application requirements for each target use-case, offering a
train-once-deploy-everywhere solution. MESS variants achieve latency gains of
up to 2.83x with the same accuracy, or 5.33 pp higher accuracy for the same
computational budget, compared to the original backbone network. Lastly, MESS
delivers orders of magnitude faster architectural customisation, compared to
state-of-the-art techniques.Comment: (Extended version) Accepted at ECCV 202